Springboard Regression Case Study - The Red Wine Dataset - Tier 3¶

Welcome to the Springboard Regression case study! Please note: this is Tier 3 of the case study.

This case study was designed for you to use Python to apply the knowledge you've acquired in reading The Art of Statistics (hereinafter AoS) by Professor Spiegelhalter. Specifically, the case study will get you doing regression analysis; a method discussed in Chapter 5 on p.121. It might be useful to have the book open at that page when doing the case study to remind you of what it is we're up to (but bear in mind that other statistical concepts, such as training and testing, will be applied, so you might have to glance at other chapters too).

The aim is to *use exploratory data analysis (EDA) and regression to predict

alcohol levels in wine with a model that's as accurate as possible*.

We'll try a univariate analysis (one involving a single explanatory variable) as well as a multivariate one (involving multiple explanatory variables), and we'll iterate together towards a decent model by the end of the notebook. The main thing is for you to see how regression analysis looks in Python and jupyter, and to get some practice implementing this analysis.

Throughout this case study, questions will be asked in the markdown cells. Try to answer these yourself in a simple text file when they come up. Most of the time, the answers will become clear as you progress through the notebook. Some of the answers may require a little research with Google and other basic resources available to every data scientist.

For this notebook, we're going to use the red wine dataset, wineQualityReds.csv. Make sure it's downloaded and sitting in your working directory. This is a very common dataset for practicing regression analysis and is actually freely available on Kaggle, here.

You're pretty familiar with the data science pipeline at this point. This project will have the following structure:

1. Sourcing and loading

Import relevant libraries
Load the data
Exploring the data
Choosing a dependent variable

2. Cleaning, transforming, and visualizing

Visualizing correlations

3. Modeling

Train/Test split
Making a Linear regression model: your first model
Making a Linear regression model: your second model: Ordinary Least Squares (OLS)
Making a Linear regression model: your third model: multiple linear regression
Making a Linear regression model: your fourth model: avoiding redundancy

4. Evaluating and concluding

Reflection
Which model was best?
Other regression algorithms

1. Sourcing and loading¶

1a. Import relevant libraries¶

# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
import statsmodels.api as sm # What does this do? Find out and type here.
from statsmodels.graphics.api import abline_plot # What does this do? Find out and type here.
from sklearn.metrics import mean_squared_error, r2_score # What does this do? Find out and type here.
from sklearn.model_selection import train_test_split #  What does this do? Find out and type here.
from sklearn import linear_model, preprocessing # What does this do? Find out and type here.
import warnings # For handling error messages.
# Don't worry about the following two instructions: they just suppress warnings that could occur later. 
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

1b. Load the data¶

# Load the data. 
df = pd.read_csv('wineQualityReds.csv',index_col=0)

1c. Exploring the data¶

# Check out its appearance. 
df.head()

# Another very useful method to call on a recently imported dataset is .info(). Call it here to get a good
# overview of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1599 entries, 1 to 1599
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed.acidity         1599 non-null   float64
 1   volatile.acidity      1599 non-null   float64
 2   citric.acid           1599 non-null   float64
 3   residual.sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free.sulfur.dioxide   1599 non-null   float64
 6   total.sulfur.dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 162.4 KB

What can you infer about the nature of these variables, as output by the info() method?

Which variables might be suitable for regression analysis, and why? For those variables that aren't suitable for regression analysis, is there another type of statistical modeling for which they are suitable?

# We should also look more closely at the dimensions of the dataset. 
df.shape

(1599, 12)

1d. Choosing a dependent variable¶

We now need to pick a dependent variable for our regression analysis: a variable whose values we will predict.

'Quality' seems to be as good a candidate as any. Let's check it out. One of the quickest and most informative ways to understand a variable is to make a histogram of it. This gives us an idea of both the center and spread of its values.

# Making a histogram of the quality variable.
df.quality.value_counts(sort=False).plot(kind='bar');

We can see so much about the quality variable just from this simple visualization. Answer yourself: what value do most wines have for quality? What is the minimum quality value below, and the maximum quality value? What is the range? Remind yourself of these summary statistical concepts by looking at p.49 of the AoS.

But can you think of a problem with making this variable the dependent variable of regression analysis? Remember the example in AoS on p.122 of predicting the heights of children from the heights of parents? Take a moment here to think about potential problems before reading on.

The issue is this: quality is a discrete variable, in that its values are integers (whole numbers) rather than floating point numbers. Thus, quality is not a continuous variable. But this means that it's actually not the best target for regression analysis.

Before we dismiss the quality variable, however, let's verify that it is indeed a discrete variable with some further exploration.

# Get a basic statistical summary of the variable 
df.describe().T

# What do you notice from this summary?

# Get a list of the values of the quality variable, and the number of occurrences of each. 
df.quality.value_counts(sort=False)

3     10
4     53
5    681
6    638
7    199
8     18
Name: quality, dtype: int64

The outputs of the describe() and value_counts() methods are consistent with our histogram, and since there are just as many values as there are rows in the dataset, we can infer that there are no NAs for the quality variable.

But scroll up again to when we called info() on our wine dataset. We could have seen there, already, that the quality variable had int64 as its type. As a result, we had sufficient information, already, to know that the quality variable was not appropriate for regression analysis. Did you figure this out yourself? If so, kudos to you!

The quality variable would, however, conduce to proper classification analysis. This is because, while the values for the quality variable are numeric, those numeric discrete values represent categories; and the prediction of category-placement is most often best done by classification algorithms. You saw the decision tree output by running a classification algorithm on the Titanic dataset on p.168 of Chapter 6 of AoS. For now, we'll continue with our regression analysis, and continue our search for a suitable dependent variable.

Now, since the rest of the variables of our wine dataset are continuous, we could — in theory — pick any of them. But that does not mean that are all equally sutiable choices. What counts as a suitable dependent variable for regression analysis is determined not just by intrinsic features of the dataset (such as data types, number of NAs etc) but by extrinsic features, such as, simply, which variables are the most interesting or useful to predict, given our aims and values in the context we're in. Almost always, we can only determine which variables are sensible choices for dependent variables with some domain knowledge.

Not all of you might be wine buffs, but one very important and interesting quality in wine is acidity. As the Waterhouse Lab at the University of California explains, 'acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are "flat." Chemically the acids influence titrable acidity which affects taste and pH which affects color, stability to oxidation, and consequantly the overall lifespan of a wine.'

If we cannot predict quality, then it seems like fixed acidity might be a great option for a dependent variable. Let's go for that.

So if we're going for fixed acidity as our dependent variable, what we now want to get is an idea of which variables are related interestingly to that dependent variable.

We can call the .corr() method on our wine data to look at all the correlations between our variables. As the documentation shows, the default correlation coefficient is the Pearson correlation coefficient (p.58 and p.396 of the AoS); but other coefficients can be plugged in as parameters. Remember, the Pearson correlation coefficient shows us how close to a straight line the data-points fall, and is a number between -1 and 1.

# Call the .corr() method on the wine dataset 
df.corr()

Ok - you might be thinking, but wouldn't it be nice if we visualized these relationships? It's hard to get a picture of the correlations between the variables without anything visual.

Very true, and this brings us to the next section.

2. Cleaning, Transforming, and Visualizing¶

2a. Visualizing correlations¶

The heading of this stage of the data science pipeline ('Cleaning, Transforming, and Visualizing') doesn't imply that we have to do all of those operations in that order. Sometimes (and this is a case in point) our data is already relatively clean, and the priority is to do some visualization. Normally, however, our data is less sterile, and we have to do some cleaning and transforming first prior to visualizing.

Now that we've chosen fixed acidity as our dependent variable for regression analysis, we can begin by plotting the pairwise relationships in the dataset, to check out how our variables relate to one another.

# Make a pairplot of the wine data
sns.pairplot(df);

If you've never executed your own Seaborn pairplot before, just take a moment to look at the output. They certainly output a lot of information at once. What can you infer from it? What can you not justifiably infer from it?

... All done?

Here's a couple things you might have noticed:

a given cell value represents the correlation that exists between two variables
on the diagonal, you can see a bunch of histograms. This is because pairplotting the variables with themselves would be pointless, so the pairplot() method instead makes histograms to show the distributions of those variables' values. This allows us to quickly see the shape of each variable's values.
the plots for the quality variable form horizontal bands, due to the fact that it's a discrete variable. We were certainly right in not pursuing a regression analysis of this variable.
Notice that some of the nice plots invite a line of best fit, such as alcohol vs density. Others, such as citric acid vs alcohol, are more inscrutable.

So we now have called the .corr() method, and the .pairplot() Seaborn method, on our wine data. Both have flaws. Happily, we can get the best of both worlds with a heatmap.

# Make a heatmap of the data 
sns.heatmap(df.corr(), cmap='coolwarm', vmax=1,vmin=-1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .7},
            annot=True,
            fmt=".1f"
           );

Take a moment to think about the following questions:¶

How does color relate to extent of correlation?
How might we use the plot to show us interesting relationships worth investigating?
More precisely, what does the heatmap show us about the fixed acidity variable's relationship to the density variable?

There is a relatively strong correlation between the density and fixed acidity variables respectively. In the next code block, call the scatterplot() method on our sns object. Make the x-axis parameter 'density', the y-axis parameter 'fixed.acidity', and the third parameter specify our wine dataset.

# Plot density against fixed.acidity
sns.scatterplot(data=df,x='density',y='fixed.acidity',alpha=0.4);

We can see a positive correlation, and quite a steep one. There are some outliers, but as a whole, there is a steep looking line that looks like it ought to be drawn.

# Call the regplot method on your sns object, with parameters: x = 'density', y = 'fixed.acidity'
sns.regplot(data=df,x='density',y='fixed.acidity', marker='+');

The line of best fit matches the overall shape of the data, but it's clear that there are some points that deviate from the line, rather than all clustering close.

Let's see if we can predict fixed acidity based on density using linear regression.

3. Modeling¶

3a. Train/Test Split¶

While this dataset is super clean, and hence doesn't require much for analysis, we still need to split our dataset into a test set and a training set.

You'll recall from p.158 of AoS that such a split is important good practice when evaluating statistical models. On p.158, Professor Spiegelhalter was evaluating a classification tree, but the same applies when we're doing regression. Normally, we train with 75% of the data and test on the remaining 25%.

To be sure, for our first model, we're only going to focus on two variables: fixed acidity as our dependent variable, and density as our sole independent predictor variable.

We'll be using sklearn here. Don't worry if not all of the syntax makes sense; just follow the rationale for what we're doing.

# Subsetting our data into our dependent and independent variables.
y = df['fixed.acidity']
X = df[['density']]

# Split the data. This line uses the sklearn function train_test_split().
# The test_size parameter means we can train with 75% of the data, and test on 25%. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# We now want to check the shape of the X train, y_train, X_test and y_test to make sure the proportions are right. 
for _ in [X_train, y_train, X_test, y_test]:
    print(_.shape)

(1199, 1)
(1199,)
(400, 1)
(400,)

3b. Making a Linear Regression model: our first model¶

Sklearn has a LinearRegression() function built into the linear_model module. We'll be using that to make our regression model.

# Create the model: make a variable called rModel, and use it linear_model.LinearRegression appropriately
lr = linear_model.LinearRegression()

# We now want to train the model on our test data.
lr.fit(X_train,y_train)

LinearRegression()

# Evaluate the model  
lr.score(X_train, y_train)

0.45487824100681484

The above score is called R-Squared coefficient, or the "coefficient of determination". It's basically a measure of how successfully our model predicts the variations in the data away from the mean: 1 would mean a perfect model that explains 100% of the variation. At the moment, our model explains only about 45% of the variation from the mean. There's more work to do!

# Use the model to make predictions about our test data
predictions = lr.predict( X_test)

# Let's plot the predictions against the actual result. Use scatter()
plt.scatter(y_test,predictions,marker='o',alpha=0.2)
plt.plot(y_test,y_test, '-r',label='Y Test')
plt.xlabel('Y Test')
plt.ylabel('1st Predicted Y');

The above scatterplot represents how well the predictions match the actual results.

Along the x-axis, we have the actual fixed acidity, and along the y-axis we have the predicted value for the fixed acidity.

There is a visible positive correlation, as the model has not been totally unsuccesful, but it's clear that it is not maximally accurate: wines with an actual fixed acidity of just over 10 have been predicted as having acidity levels from about 6.3 to 13.

Let's build a similar model using a different package, to see if we get a better result that way.

3c. Making a Linear Regression model: our second model: Ordinary Least Squares (OLS)¶

OLS ¶

# Create the test and train sets. Here, we do things slightly differently.  
# We make the explanatory variable X as before.
X = df[['density']]


# But here, reassign X the value of adding a constant to it. This is required for Ordinary Least Squares Regression.
# Further explanation of this can be found here: 
# https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html
X = sm.add_constant(X)

# The rest of the preparation is as before.
y = df['fixed.acidity']

# Split the data using train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

# Create the model
ols = sm.OLS(y_train, X_train)

# Fit the model with fit() 
results = ols.fit()

# Evaluate the model with .summary()
results.summary()

R-Squared ¶

One of the great things about Statsmodels (sm) is that you get so much information from the summary() method.

There are lots of values here, whose meanings you can explore at your leisure, but here's one of the most important: the R-squared score is 0.455, the same as what it was with the previous model. This makes perfect sense, right? It's the same value as the score from sklearn, because they've both used the same algorithm on the same data.

Here's a useful link you can check out if you have the time: https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

# Let's use our new model to make predictions of the dependent variable y. Use predict(), and plug in X_test as the parameter
predictions2 = results.predict(X_test)

# Plot the predictions
# Build a scatterplot
plt.scatter(y_test, predictions2, marker='o',alpha=0.2)

# Add a line for perfect correlation. Can you see what this line is doing? Use plot()
plt.plot(y_test,y_test,'-r', label='Y Test')

# Label it nicely
plt.xlabel('Y Test')
plt.ylabel('2nd Predicted Y');
plt.legend();

The red line shows a theoretically perfect correlation between our actual and predicted values - the line that would exist if every prediction was completely correct. It's clear that while our points have a generally similar direction, they don't match the red line at all; we still have more work to do.

To get a better predictive model, we should use more than one variable.

3d. Making a Linear Regression model: our third model: multiple linear regression¶

Remember, as Professor Spiegelhalter explains on p.132 of AoS, including more than one explanatory variable into a linear regression analysis is known as multiple linear regression.

# Create test and train datasets
# This is again very similar, but now we include more columns in the predictors
# Include all columns from data in the explanatory variables X except fixed.acidity and quality (which was an integer)
X = df.drop(["fixed.acidity", "quality"],axis=1)

# Create constants for X, so the model knows its bounds
X = sm.add_constant(X)
y = df[["fixed.acidity"]]


# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

# We can use almost identical code to create the third model, because it is the same algorithm, just different inputs
# Create the model
ols3 = sm.OLS(y_train, X_train)

# Fit the model
results3 = ols3.fit()

# Evaluate the model
results3.summary()

The R-Squared score shows a big improvement - our first model predicted only around 45% of the variation, but now we are predicting 87%!

# Use our new model to make predictions
predictions3 = results3.predict(X_test)

# Plot the predictions
# Build a scatterplot
plt.scatter(y_test, predictions3, marker='o',alpha=0.2)

# Add a line for perfect correlation. Can you see what this line is doing? Use plot()
plt.plot(y_test,y_test,'-r', label='Y Test')

# Label it nicely
plt.xlabel('Y Test')
plt.ylabel('3rd Predicted Y');
plt.legend();

We've now got a much closer match between our data and our predictions, and we can see that the shape of the data points is much more similar to the red line.

We can check another metric as well - the RMSE (Root Mean Squared Error). The MSE is defined by Professor Spiegelhalter on p.393 of AoS, and the RMSE is just the square root of that value. This is a measure of the accuracy of a regression model. Very simply put, it's formed by finding the average difference between predictions and actual values. Check out p. 163 of AoS for a reminder of how this works.

# from sklearn import metrics
# print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions3)))

# Define a function to check the RMSE. Remember the def keyword needed to make functions? 
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

# Get predictions from rModel3
# ?

# Put the predictions & actual values into a dataframe
matches = pd.DataFrame(y_test)
matches.rename(columns = {'fixed.acidity':'actual'}, inplace=True)
matches["predicted"] = predictions3

rmse(matches["actual"], matches["predicted"])

0.6163194678949845

The RMSE tells us how far, on average, our predictions were mistaken. An RMSE of 0 would mean we were making perfect predictions. 0.6 signifies that we are, on average, about 0.6 of a unit of fixed acidity away from the correct answer. That's not bad at all.

3e. Making a Linear Regression model: our fourth model: avoiding redundancy¶

We can also see from our early heat map that volatile.acidity and citric.acid are both correlated with pH. We can make a model that ignores those two variables and just uses pH, in an attempt to remove redundancy from our model.

df.columns

Index(['fixed.acidity', 'volatile.acidity', 'citric.acid', 'residual.sugar',
       'chlorides', 'free.sulfur.dioxide', 'total.sulfur.dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

# Create test and train datasets
# Include the remaining six columns as predictors
# 'free.sulfur.dioxide' is also highly correlated with the 'total.sulfur.dioxide' so being excluded.
# Based on the previous pairplot. alcohol is highly correlated with the density and citric acid so being excluded
X = df.drop(['volatile.acidity','citric.acid','free.sulfur.dioxide','fixed.acidity','quality','alcohol'],axis=1)

# Create constants for X, so the model knows its bounds
X = sm.add_constant(X)

y = df['fixed.acidity']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

# Create the fifth model
ols4 = sm.OLS(y_train, X_train)

# Fit the model
results4 = ols4.fit()

# Evaluate the model
results4.summary()

The R-squared score has reduced, showing us that actually, the removed columns were important.

Conclusions & next steps¶

Congratulations on getting through this implementation of regression and good data science practice in Python!

Take a moment to reflect on which model was the best, before reading on.

. . .

Here's one conclusion that seems right. While our most predictively powerful model was rModel3, this model had explanatory variables that were correlated with one another, which made some redundancy. Our most elegant and economical model was rModel4 - it used just a few predictors to get a good result.

All of our models in this notebook have used the OLS algorithm - Ordinary Least Squares. There are many other regression algorithms, and if you have time, it would be good to investigate them. You can find some examples here. Be sure to make a note of what you find, and chat through it with your mentor at your next call.

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
1	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
2	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
3	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
4	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
5	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

	count	mean	std	min	25%	50%	75%	max
fixed.acidity	1599.0	8.319637	1.741096	4.60000	7.1000	7.90000	9.200000	15.90000
volatile.acidity	1599.0	0.527821	0.179060	0.12000	0.3900	0.52000	0.640000	1.58000
citric.acid	1599.0	0.270976	0.194801	0.00000	0.0900	0.26000	0.420000	1.00000
residual.sugar	1599.0	2.538806	1.409928	0.90000	1.9000	2.20000	2.600000	15.50000
chlorides	1599.0	0.087467	0.047065	0.01200	0.0700	0.07900	0.090000	0.61100
free.sulfur.dioxide	1599.0	15.874922	10.460157	1.00000	7.0000	14.00000	21.000000	72.00000
total.sulfur.dioxide	1599.0	46.467792	32.895324	6.00000	22.0000	38.00000	62.000000	289.00000
density	1599.0	0.996747	0.001887	0.99007	0.9956	0.99675	0.997835	1.00369
pH	1599.0	3.311113	0.154386	2.74000	3.2100	3.31000	3.400000	4.01000
sulphates	1599.0	0.658149	0.169507	0.33000	0.5500	0.62000	0.730000	2.00000
alcohol	1599.0	10.422983	1.065668	8.40000	9.5000	10.20000	11.100000	14.90000
quality	1599.0	5.636023	0.807569	3.00000	5.0000	6.00000	6.000000	8.00000

	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol	quality
fixed.acidity	1.000000	-0.256131	0.671703	0.114777	0.093705	-0.153794	-0.113181	0.668047	-0.682978	0.183006	-0.061668	0.124052
volatile.acidity	-0.256131	1.000000	-0.552496	0.001918	0.061298	-0.010504	0.076470	0.022026	0.234937	-0.260987	-0.202288	-0.390558
citric.acid	0.671703	-0.552496	1.000000	0.143577	0.203823	-0.060978	0.035533	0.364947	-0.541904	0.312770	0.109903	0.226373
residual.sugar	0.114777	0.001918	0.143577	1.000000	0.055610	0.187049	0.203028	0.355283	-0.085652	0.005527	0.042075	0.013732
chlorides	0.093705	0.061298	0.203823	0.055610	1.000000	0.005562	0.047400	0.200632	-0.265026	0.371260	-0.221141	-0.128907
free.sulfur.dioxide	-0.153794	-0.010504	-0.060978	0.187049	0.005562	1.000000	0.667666	-0.021946	0.070377	0.051658	-0.069408	-0.050656
total.sulfur.dioxide	-0.113181	0.076470	0.035533	0.203028	0.047400	0.667666	1.000000	0.071269	-0.066495	0.042947	-0.205654	-0.185100
density	0.668047	0.022026	0.364947	0.355283	0.200632	-0.021946	0.071269	1.000000	-0.341699	0.148506	-0.496180	-0.174919
pH	-0.682978	0.234937	-0.541904	-0.085652	-0.265026	0.070377	-0.066495	-0.341699	1.000000	-0.196648	0.205633	-0.057731
sulphates	0.183006	-0.260987	0.312770	0.005527	0.371260	0.051658	0.042947	0.148506	-0.196648	1.000000	0.093595	0.251397
alcohol	-0.061668	-0.202288	0.109903	0.042075	-0.221141	-0.069408	-0.205654	-0.496180	0.205633	0.093595	1.000000	0.476166
quality	0.124052	-0.390558	0.226373	0.013732	-0.128907	-0.050656	-0.185100	-0.174919	-0.057731	0.251397	0.476166	1.000000

Dep. Variable:	fixed.acidity	R-squared:	0.455
Model:	OLS	Adj. R-squared:	0.454
Method:	Least Squares	F-statistic:	998.8
Date:	Mon, 15 Nov 2021	Prob (F-statistic):	6.68e-160
Time:	10:29:45	Log-Likelihood:	-2011.0
No. Observations:	1199	AIC:	4026.
Df Residuals:	1197	BIC:	4036.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-615.7316	19.746	-31.182	0.000	-654.473	-576.990
density	626.0927	19.810	31.604	0.000	587.226	664.959

Omnibus:	94.056	Durbin-Watson:	1.985
Prob(Omnibus):	0.000	Jarque-Bera (JB):	122.229
Skew:	0.668	Prob(JB):	2.87e-27
Kurtosis:	3.812	Cond. No.	1.06e+03

Omnibus:	155.238	Durbin-Watson:	2.052
Prob(Omnibus):	0.000	Jarque-Bera (JB):	562.315
Skew:	0.595	Prob(JB):	7.85e-123
Kurtosis:	6.137	Cond. No.	7.23e+04

Omnibus:	105.987	Durbin-Watson:	2.002
Prob(Omnibus):	0.000	Jarque-Bera (JB):	206.837
Skew:	0.572	Prob(JB):	1.22e-45
Kurtosis:	4.683	Cond. No.	5.08e+04

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-648.2418	15.246	-42.518	0.000	-678.154	-618.329
volatile.acidity	0.1313	0.135	0.971	0.332	-0.134	0.397
citric.acid	1.8651	0.155	11.995	0.000	1.560	2.170
residual.sugar	-0.2485	0.015	-16.589	0.000	-0.278	-0.219
chlorides	-3.6575	0.443	-8.263	0.000	-4.526	-2.789
free.sulfur.dioxide	0.0068	0.002	2.859	0.004	0.002	0.012
total.sulfur.dioxide	-0.0064	0.001	-7.966	0.000	-0.008	-0.005
density	671.0968	15.184	44.198	0.000	641.306	700.887
pH	-5.1954	0.149	-34.792	0.000	-5.488	-4.902
sulphates	-0.8038	0.130	-6.193	0.000	-1.058	-0.549
alcohol	0.5713	0.025	22.753	0.000	0.522	0.621

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-485.6576	16.010	-30.335	0.000	-517.068	-454.247
residual.sugar	-0.1078	0.020	-5.481	0.000	-0.146	-0.069
chlorides	-6.3544	0.578	-10.990	0.000	-7.489	-5.220
total.sulfur.dioxide	-0.0094	0.001	-11.799	0.000	-0.011	-0.008
density	516.4441	15.894	32.492	0.000	485.260	547.628
pH	-6.0430	0.184	-32.766	0.000	-6.405	-5.681
sulphates	0.7540	0.165	4.559	0.000	0.430	1.078